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Abstract —There are only a few learning algorithms applicable 
to stochastic dynamic teams and games which generalize Markov 
decision processes to decentralized stochastic control problems 
involving possibly self-interested decision makers. Learning in 
games is generally difficult because of the non-stationary envi¬ 
ronment in which each decision maker aims to learn its optimal 
decisions with minimal information in the presence of the other 
decision makers who are also learning. In stochastic dynamic 
games, learning is more challenging because, while learning, the 
decision makers alter the state of the system and hence the 
future cost. In this paper, we present decentralized Q-learning 
algorithms for stochastic games, and study their convergence for 
the weakly acyclic case which includes team problems as an 
important special case. The algorithm is decentralized in that 
each decision maker has access to only its local information, the 
state information, and the local cost realizations; furthermore, it 
is completely oblivious to the presence of other decision makers. 
We show that these algorithms converge to equilibrium policies 
almost surely in large classes of stochastic games. 


I. Introduction 

This paper aims at developing new learning algorithms 
with desirable convergence properties for certain classes of 
stochastic games, which are discrete-time dynamic games in 
which the history can be summarized by a “state” m. More 
specifically, we focus on weakly acyclic stochastic games that 
can be used to model cooperative systems. The chief merit of 
the paper lies in the fact that learning takes place in stochastic 
games, which are truly dynamic games, as opposed to learning 
in repeated games in which the same single-stage game is 
played in every stage. In stochastic games, the policies selected 
by the decision makers not only impact their immediate cost 
but also alter the stage-games to be played in the future 
through the state dynamics. Hence, our results are applicable 
to a significantly broader set of applications. 

The existing literature on learning in stochastic games is 
very small in comparison with the literature on learning in 
repeated games. As the method of reinforcement learning 
gained popularity in the context of Markov decision problems, 
a surge of interest in generalizing the method of reinforcement 
learning, in particular Q-learning algorithm j2|, to stochastic 
games has led to a set of publications primarily in the computer 
science literature; see 0 and the references therein. In many 
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of these publications, the authors tend to assume that the real 
objective of the agents Q is for some reason to find and play an 
equilibrium strategy (and sometimes this even requires agents 
to somehow agree on a particular equilibrium strategy), and 
not necessarily to pursue their own objectives. Another serious 
issue is that the multi-agent algorithms introduced in many of 
these recent papers are not scalable since each agent needs to 
maintain estimates of its Q-factors for each state/joint action 
pair and compute an equilibrium at each step of the algorithm 
using the updated estimates, assuming that the actions and 
objectives are exchanged between all agents. 

Standard Q-learning, which enables an agent to learn how 
to play optimally in a single-agent environment, has also been 
applied to very specific multi agent applications m, 0 . Here, 
each agent runs a standard Q-learning algorithm by ignoring 
the other agents, and hence information exchange between 
agents and computational burden on each agent are substan¬ 
tially lower than aforementioned multi-agent extensions of Q- 
learning algorithm. Also, standard Q-learning in a multi-agent 
environment makes sense from individual bounded rationality 
point of view. However, no analytical results exist regarding 
the properties of standard Q-learning in a stochastic game 
setting. 

We should also mention several attempts to extend a well- 
known learning algorithm called Fictitious Play (FP) 0, (7) 
to stochastic games 0, S3, Go). The joint action learning 
algorithm presented in 0 would be computationally pro¬ 
hibitive quickly as the number of agents/states/actions grow. 
The algorithms presented in 0 are claimed to be convergent 
to an equilibrium in single-state single-stage common interest 
games but without a proof. The extension of FP considered in 
0 requires each agent to calculate a stationary policy at each 
step in response to the empirical frequencies of the stationary 
policies calculated and announced by other agents in the past. 
The main contribution of 0 is to show that such FP algorithm 
is not convergent even in the simplest 2x2x2 stochastic game 
where there are two states and two agents with two moves for 
each agent. The version of FP used in fTOl is applicable only 
to zero-sum games (strictly adversarial games). 

Other related work includes El, El, ED. In El, a multi¬ 
agent version of an actor-critic algorithm M is shown to 
be convergent to generalized equilibria in a weak sense of 
convergence, whereas in lfl2l a policy iteration algorithm is 
presented without rigorous results for stochastic games. The 
algorithms given in El, El are rational from individual 
agent perspective, however they require higher level of data 
storing and processing than standard Q-learning. The paper 

1 1"he terms “agent” and “decision maker” are used interchangeably. 
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m uses the policy iteration algorithm given in da in 
conjunction with certain approximation methods to deal with 
a large state-space in a specific card-game without rigorous 
results. 

We should emphasize that our viewpoint is individual 
bounded rationality and strategic decision making, that is, 
agents should act to pursue their own objectives even in 
the short term using localized information and reasonable 
algorithms. It is also desired that agent strategies converge 
to an agreeable solution in cooperative situations where agent 
objectives are aligned with system designer’s objective even 
though agents do not necessarily strive for converging to a 
particular strategy. 

The rest of the paper is organized as follows. In $[11 the 
model is introduced. In § |III1 the specifics of the learning 
paradigm and the standard Q-learning algorithm is discussed, 
followed by the presentation of our first Q-learning algorithm 
for stochastic games and its convergence properties. General¬ 
izations of our main results in §|III]are presented in ' ll VI This is 
followed by a simulation study in [JV] The paper is concluded 
with some final remarks in j jVI] Appendices contain the proofs 
of the technical results in the paper. 

II. Stochastic Dynamic Games 

Consider the (discrete-time) networked control system illus¬ 
trated in Figure |T] where Xt is the state of the system at time 
t, u' t is the input generated by controller i at time f, and iu t 
is the random disturbance input at time t. Suppose that each 



Fig. 1. A networked control system. 


controller i is an autonomous decision maker (DM) interested 
in minimizing its own long-term cost 


E 


Zc*(x t ,ul, 

t> 0 



where c l (xt,uj,..., u is the cost incurred by controller i at 
time t, and E\-] denotes the expectation given a collection of 
control policies (which will be specified later in the paper) on 
a probability space (Q,E,P). Although controller i can only 
choose its own decisions Uq, u\, ..., its cost generally depends 
on the decisions of all controllers through its single-stage cost 
as well as the state dynamics. This dynamic coupling between 
self-interested DMs with long-term objectives naturally lead 
to the framework of stochastic games ID which generalize 
Markov decision problems. 


Over the past half-century, there have been many appli¬ 
cations of stochastic games on control problems; see Chap¬ 
ter XIV in m as an early reference. At the present time, 
the control theory literature includes a large number of pa¬ 
pers employing the theory of stochastic games and their 
continuous-time counterparts called “differential games” HE 
Many papers in this body of work study a zero-sum game 
between a controller which aims to optimize the system 
performance and an adversary which controls certain system 
parameters and inputs to make the system performance as poor 
as possible. We selectively cite 113 for robust control and 
minimax estimation problems, IT8l for flow control in queue¬ 
ing networks, US) for control of hybrid systems, and ll20l for 
robustness, security, and resilience of cyber-physical control 
systems. The case of nonzero-sum games in which the decision 
makers do not always have diametrically opposed objectives 
has also received significant attention; see for example ED on 
admission, service, and routing control in queueing systems, 
G2 on transmission control in cognitive radio systems, ll2Jl 
on network security, and |[24ft on formation control. 

We should also mention the work on team decision problems 
where all DMs share a common long-term objective albeit with 
access to different information variables; see e.g., 125], (26]. In 
this paper, differently from the usual team decision problems 
in the literature, even though each DM has access to the state 
information, it does not have access to global information 
on the other DMs, and even their presence. We also note 
that the emergence of distributed control systems requires 
the formulation of “team problems” within a game-theoretic 
framework where local controllers are tasked to achieve one 
system level objective without centralized coordination; see for 
example j27l on distributed model predictive control. This type 
of team problems and its generalizations where the objectives 
of DMs are aligned in some sense with a team objective are 
the primary focus of our work though the class of games 
considered in this paper is more general and it even includes 
some zero-sum stochastic games. 

A. Discounted Stochastic Dynamic Games 

A (finite) discounted stochastic game has the following 
ingredients; see ID. 

• A finite set of DMs with the i —th DM referred to as DM* 
for i £ {1,..., N} 

« a finite set X of states 

• a finite set IP of control decisions for each DM 1 

• a cost function c* for each DM* determining DM”s cost 
c l [x , it 1 ,..., u N ) at each state teX and for each joint 
decision (it 1 ,..., u N ) £ U 1 x • • ■ 

• a discount factor /?* £ (0,1) for each DM* 

• a random initial state xq £ X 

• a transition kernel for the probability P[x'\x, it 1 ,..., it' v ] 
of each state transition from x £ X to x' £ X for each 
joint decision (it 1 ,..., u N ) £ U 1 x • • • U jV . 

Such a stochastic game induces a discrete-time controlled 
Markov process where the state at time t is denoted by Xt £ X 
starting with the initial state Xq. At any time / > 0, each 
DM* makes a control decision u\ £ U* (possibly randomly) 
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based on the available information. The state Xt and the 
joint decisions (uj, together determine each DM"s 

cost c l (x t ,ul, at time t as well as the probability 

distribution P[ ■ \ x t .u\,.... ttj v ] with which the next state 
Xt+i is selected. 

A policy for a DM is a rule of choosing an appropriate 
control decision at any time based on the DM’s history of 
observations. We will focus on stationary policies of the form 
where a DM’s decision at time t is determined solely based 
on the state xt■ Such policies for each DM 1 are identified 
by mappings from the state space X to the set 'P(TP) of 
probability distributions on U*. The interpretation is that a 
DM* using such a policy 7 r’ :Xh 'P(ILF) makes its decision 
u\ at any time t by choosing randomly from IT according 
to 7 T l (xt). We will denote the set of such policies by A* for 
each DM*. We will primarily be interested in deterministic 
(stationary) policies^] denoted by IT for each DM*, where each 
policy 7r* £ IT is identified by a mapping from X to U ! . 

The objective of each DM* is to find a policy 7r* £ A* that 
minimizes its expected discounted cost 


4(ir 1 ,...ir N ) = E x 


t> 0 


(/?*)V (; X U u\,...,U f) 


(1) 


for all x £ X, where E x denotes the conditional expectation 
given xo = x. Since DMs have possibly different cost 
functions and each DM’s cost may depend on the control 
decisions of the other DMs, we adopt the notion of equilibrium 
to represent those policies that are person-by-person optimal. 
For ease of notation, we denote the policies of all DMs 
other than DM* by 7 r~*. For future reference, we also define 
II - * := x J 5 £,;IF and A - * := x^A-? as well as II := x ,j IP 
and A := Xj AT Using this notation, we write a joint policy 
(7T 1 ,... 7 ^) as (7T*, 7T - *) and J l x (n l ,... n N ) as J* (7r*, 7r - *). 

Definition 1: A joint policy (7 T* 1 ,..., ir* N ) £ A constitutes 
an (Markov perfect) equilibrium if, for all i, x. 


J i x (n* i ,n*~ i ) = min 4(7^,7T*-*). 
ir'gA 1 

It is known that any finite discounted stochastic game pos¬ 
sesses an equilibrium policy as defined above ll28l . 

Although the minimum above can always be achieved by 
a deterministic policy in II* (since each DM*’s problem is 
a stationary Markov decision problem when the policies of 
the other DMs are fixed at 7r* - *), a deterministic equilibrium 
policy may not exist in general. However, many interesting 
classes of games do possess equilibrium in deterministic 
policies. In particular, large classes of games arising from 
applications where all DMs benefit from cooperation possess 
equilibrium in deterministic policies. The primary examples 
of such games of cooperation are team problems where all 
DMs have the same cost function. In team problems, the 
deterministic policies minimizing the common cost function 
are clearly equilibrium policies although non-optimal deter¬ 
ministic equilibrium policies may also exist. A more general 
set of games of cooperation are those in which some function, 


2 When it is not clear from the context, a “policy” will mean a deterministic 
policy. 


called the potential function, decreases whenever a single DM 
decreases its own cost by unilaterally switching from one 
deterministic policy to another one. In this class of games, 
the deterministic policies minimizing the potential function are 
equilibrium policies. As such, we are primarily interested in 
the set of deterministic equilibrium policies denoted by n eq , 
where n eq C n. 

We next formally introduce the set of games considered in 
this paper. 

B. Weakly Acyclic Games 

Let n^_i denote DM*’s set of (deterministic) best replies 
to any 7r“* £ A - *, i.e., 

n;_i := {n l £ n* : J x (n l ,n~‘ l ) = min J x (n l 

for all x }. 

DM*’s best replies to any 7r“* £ A - * can be characterized by 
its optimal Q-factors Q l _ 4 satisfying the fixed point equation 

Ql-i(x,u % ) =E n - i(x j [c\x,u\u~ l ) 

+ /3 l P\x'\x, u l , u~ l ] min (x', u*)] 

x'ex v ' ew 

( 2 ) 

for all x,u l , where E n -i( x \ denotes the expectation with 
respect to the joint distribution of u~ l given by 7r - *(:r) = 
7T 1 (a;) x • • • x 7r* _1 (a:) x 7r* +1 (:r) x • ■ • x n N (x). The optimal 
Q-factor Q^-i^XjU 1 ) represents DM*’s expected discounted 
cost to go from the initial state x assuming that DM* initially 
chooses it* and uses an optimal policy thereafter while the 
other DMs use 7r _ *. One can then write n* as 

7T 

n 5r-> = {^ l € n* : Ql-i(x, n\x)) = min Q\-% (x, v l ), 

u'eU 1 

for all x }. 

The set of (deterministic) joint best replies is denoted by 
:= n*_i x • • • x n^_iv- Aliy best reply 7r* £ of 

DM* is called a strict best reply with respect to ( 7 F, 7r - *) if 

J*(7r*,7r - *) < J*(7r*,7T - *), for some x. 

Such a strict best reply 7 r* achieves DM*’s minimum cost given 
7r _ * for all initial states and results in a strict improvement over 
7r* for at least one initial state. 

Definition 2: We call a (possibly finite) sequence of deter¬ 
ministic joint policies no, 7Ti,... a strict best reply path if, for 
each k, tt^ and ~r+'\ differ in exactly one DM position, say 
DM*, and 7rj^ +1 is a strict best reply with respect to 7 r^. 

Definition 3: A discounted stochastic game is called weakly 
acyclic under strict best replies if there is a strict best reply 
path starting from each deterministic joint policy and ending 
at a deterministic equilibrium policy. 

Figure [2] shows the strict best reply graph of a game where 
the nodes represent the deterministic joint policies and the 
directed edges represent the single-DM strict best replies. Each 
deterministic equilibrium policy is represented by a sink, i.e., 
a node with no outgoing edges, in such a graph. Note that the 
game illustrated in Figure [2] is weakly acyclic under strict best 
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Fig. 2. The strict best reply graph of a stochastic game. 

replies since there is a path from every node to a sink (w? or 
7r 10 ). Note also that a weakly acyclic game may have cycles in 
its strict best reply graph, for example, tti —> 714 —> 7 r 12 —» 7Tg 
in Figure [2] 

Weakly acyclic games constitute a fairly large class of 
games. In the case of single-stage games, all potential games 
as well as dominance solvable games are examples of weakly 
acyclic games; see l29l . We note that the concept of weak 
acyclicity introduced in this paper is with respect to the 
stationary Markov policies for stochastic games, and consti¬ 
tutes a generalization of weak acyclicity introduced in ESI 
for single-stage games. The primary examples of weakly 
acyclic games in the case of stochastic games are the team 
problems with finite state and control sets where DMs have 
identical cost functions and discount factors. Clearly, many 
other classes of stochastic games are weakly acyclic, e.g., 
appropriate multi-stage generalizations of potential games and 
dominance solvable games restricted to the stationary Markov 
policies are weakly acyclic for the same reason that the single- 
stage versions of these games are weakly acyclic l29l . 


C. A Best Reply Process for Weakly Acyclic Games 

Consider a policy adjustment process in which only one DM 
updates its policy at each step by switching to one of its strict 
best replies. Such a process would terminate at an equilibrium 
policy if the game has no cycles in its strict best reply graph 
and the process continues until no DM has strict best replies. 
A weakly acyclic game may contain cycles in its strict best 
reply graph but there must be some edges leaving each cycle 
because otherwise there would not be a path from each node 
to a sink. Therefore, as long as each updating DM considers 
each of its strict best replies with positive probability, the 
adjustment process would terminate at an equilibrium policy 
in a weakly acyclic game with probability (w.p.) one. This 
adjustment process would require a criterion to determine the 
updating DM at each step and the DMs would have to a 
priori agree to this criterion. An equilibrium policy can be 
reached through a similar adjustment process without a pre¬ 
game agreement on the selection of the updating DM, if all 
DMs update their policies at each step but with some inertia. 
Consider now the following policy adjustment process, which 
is the best reply process with memory length of one and inertia 
introduced in Sections 6.4-6.5 of EO). 

Best Reply Process with Inertia (for DM 1 ): 

Set parameters 

A 1 £ (0,1): inertia 
Initialize 7 Tq £ II 1 (arbitrary) 


Iterate k > 0 
If tt[ £ IT _ 4 

O' 7T. 


Else 


‘fc+i 


= 7T . 


‘ fc +1 


any 7r* £ II* 


w.p. X i 

w.p. (1 - A*)/ 


End 


IF 


On the one hand, if the joint policy ttk := (7 r^,..., ) is 
an equilibrium policy at any step k, then the policies will never 
change in the subsequent steps. On the other hand, regardless 
of what the joint policy ttf- := ( 71 -^,..., tt£?) is at any step k, 
the joint policy ttk+L in L steps later will be an equilibrium 
policy with positive probability p ln j n > 0 where L is the 
maximum length of the shortest strict best reply path from 
any policy to an equilibrium policy and p rmll depends only 
on the inertias A 1 , ..., A N , and L. This readily implies that 
the best reply process with inertia will reach an equilibrium 
policy in finite number of steps w.p. 1 li30l , i.e., 

P[tTk = 7r*, for some 7r* £ n eq and all large k < 00 ] = 1. 

We now note that each updating DM 1 at step k needs to 
compute its best replies !!)__„ which can be done by first 

solving the fixed point equation © for 7r" i = 7 r* \ DM* 
can solve (0. for example through value iterations, provided 
that DM 1 knows the state transition probabilities P and the 
policies ~f l of the other DMs to evaluate the expectations in 
0. In most realistic situations, DMs would not have access to 
such information and therefore would not be able to compute 
their best replies directly. In the next section, we introduce our 
learning paradigm in which DMs would be able to learn their 
near best replies with minimal information and adjust their 
policies (approximately) along the strict best reply paths as in 
the best reply process with inertia. 


III. Q-Learning in Stochastic Dynamic Games 
A. Learning Paradigm for Stochastic Dynamic Games 

The learning setup involves specifying the information that 
DMs have access to. We assume that each DM* knows its own 
set U* of decisions and its own discount factor fP. In addition, 
before choosing its decision u\ at any time t, each DM* has 
the knowledge of 

• its own past decisions u l 0 ,..., u\_ 1 , and 

• past and current state realizations Xq, ..., x t , and 

« its own past cost realizations 

c\x 0 , u l 0 , uf *),..., c*(at t _ 1 , u\_ x , ufff). 

Each DM* has access to no other information such as the state 
transition probabilities or any information regarding the other 
DMs (not even the existence of the other DMs). In effect, 
the problem of decision making from the perspective of each 
DM* appears to be a stationary Markov decision problem. It 
is reasonable that each DM* with this view of its environment 
would use the standard Q-learning algorithm f2j to learn its 
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optimal Q-factors and its optimal decisions. This would lead 
to the following Q-learning dynamics for each DM 1 : 

Q\ + i(x,u % ) =Ql(x,u l ), for all ( x,u l ) ± (: x t ,u \) 

Ql+i{x t ,u l t ) =Q\(x t ,u\) + a\[c\x t ,u\,ui l ) 

+ ft 1 min Ql(x t+ i,v l ) - Q\(x t ,u\)\ 
v'eu 1 

where a\ G [0,1] denotes DM’’s step size at time t. 

If only one DM, say DM*, were to use Q-learning and 
the other DMs used constant policies 7r - *, then DM* would 
asymptotically learn its corresponding optimal Q-factors, i.e., 

P[Ql -»■ Qi-<] = 1 

provided that all state-control pairs x, u' are visited infinitely 
often and the step sizes are reduced at a proper rate. This 
follows from the well-known convergence of Q-learning in 
a stationary environment; see ED- To exploit the learnt Q- 
factors while maintaining exploration, the actual decisions are 
often selected with very high probability as 


u\ G argmin v i &V iQl(xt, v l ) 

and with some small probability any decision in U* is exper¬ 
imented. One common way of achieving this for DM* is to 
select any decision a' G U* randomly according to (Boltzman 
action selection) 

e -Ql(xt,u % )/r 

T ■ wt . e~Qt( Xt ’ vi ' > / T 

where r > 0 is a small constant called the temperature 
parameter, and P t is the history of the random events realized 
up to the point just before the selection of (u \,..., u^). 

However, when all DMs use Q-learning and select their de¬ 
cisions as described above, the environment is non-stationary 
for all DMs, and there is no reason to expect convergence 
in that case. In fact, one can construct examples where DMs 
using Q-learning are caught up in persistent oscillations; see 
Section 4 in l32l for the non-convergence of Q-learning in 
Shapley’s game. However, the convergence of Q-learning may 
still be possible in team problems, coordination-type games, 
or more generally in weakly-acyclic games. It is instructive to 
first consider the repeated games. 

Here, there is no state dynamics (the set X of states is a 
singleton) and the DMs have no look-ahead (/3 1 = • • • /3 N = 
0). The only dynamics in this case is due to Q-learning which 
reduces to the averaging dynamics 


PK = u*|j- t ] 


Q i t+1 (u i )=Q i t (u i ), for all u^u] (3) 

Qi+M) = Qi K) + a* [c*K, V) - Q\ K)] (4) 


where 


P[u\ = u i \P t ] 


e -QK u ')P 


E 


i^eu* 


o-Qt( vi )/ T ' 


(5) 


The long-term behavior of these averaging dynamics is ana¬ 
lyzed in f32l and strongly connected to the long-term behavior 
of the well-known Stochastic Fictitious Play (SFP) dynamics 
I.33JI in the case of two DMs; see Lemma 4.1 in [|32l . In two- 
DM SFP, each DM* tracks the empirical frequencies of the past 


decisions of its opponent DM - * and chooses a nearly optimal 
response (with some experimentation) based on the incorrect 
assumption that DM - * will choose its decisions according to 
the empirical frequencies of its past decisions 

®r i (« _< ) = T J2 7 K -i =u-H> for a11 u ~ l 

1 k=0 

where /{. i is the indicator function and 

e -M*(«*)/r 

E^e - ^)/- 

U~ % 

Using the connection between Q-learning dynamics ©- 
© and SFP dynamics, the convergence of Q-learning ©-© 
is established in zero-sum games as well as in partnership 
games with two DMs; see Proposition 4.2 in 11321 . It may 
be possible to extend this convergence result to multi-DM 
potential games 134-1. |35j, but this is currently unresolved. 
However, given the nonconvergence of FP (where DMs choose 
exact optimal responses with no experimentation, i.e., r j, 0) 
in some coordination games ll36l , the prospect of establishing 
the convergence of Q-learning even in all two-DM weakly 
acyclic games does not seem promising. 

It is possible to employ additional features such as the trun¬ 
cation of the observation history or multi-time-scale learning 
to obtain learning dynamics that are convergent in all repeated 
weakly acyclic games; see our own previous work 1 371 and the 
others li38l , l30l , ED, ll40l . However, the question of learning 
an equilibrium policy in stochastic games is an open question. 
The only relevant reference considering the stochastic games is 
ED where each DM uses value learning coupled with policy 
search at a slower time-scale. The results in EH apply to 
all stochastic games and therefore they are necessarily quite 
weak. Loosely speaking, the main result in EH shows that 
the limit points of certain empirical measures (weighted with 
the step sizes) in the policy space constitute “generalized Nash 
equilibria”, which in particular does not imply convergence of 
learning to an equilibrium policy. In the next subsection, we 
propose a simple variation of Q-learning which converges to 
an equilibrium policy in all weakly acyclic stochastic games. 

B. Q-Leaming in Stochastic Dynamic Games 

The discussion in the previous subsection reveals that the 
standard Q-learning ©-© can lead to robust oscillations even 
in repeated coordination games. The main obstacle to conver¬ 
gence of Q-learning in games is due to the presence of multiple 
active learners leading to a non-stationary environment for all 
learners. To overcome this obstacle, we use some inspiration 
from our previous work lf37l on repeated games and modify 
the Q-learning for stochastic games as follows. In our variation 
of Q-learning, we allow DMs to use constant policies for 
extended periods of time called exploration phases. 

As illustrated in Figure [3] the k- th exploration phase runs 
through times t = tk, ■ ■ ■, tk+i — 1, where 

tk+i = tk + T k (with tk = 0) 


PK = u l \P t ] 
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i k - th exploration phase i 

jtt) | 

I*- Tt - i 

H-i-1-1-1-1-1-1-1-1-1-1-- l 

*k *k+l 

Fig. 3. An illustration of the k —th exploration phase. 

for some integer T k £ [l,oo) denoting the length of the 
fc—th exploration phase. During the k- th exploration phase, 
DMs use some constant policies 7r/,..., nj? as their baseline 
policies with occasional experimentation. The essence of the 
main idea is to create a stationary environment over each 
exploration phase so that DMs can accurately learn their 
optimal Q-factors corresponding to the constant policies used 
during each exploration phase. Before arguing why this would 
lead to an equilibrium policy in all weakly acyclic stochastic 
games, let us introduce our variation of Q-learning more 
precisely. 

Algorithm 1 (for DM 1 ): 

Set parameters 

Q®: some compact subset of the Euclidian space Rl XxU I 
where |X x U®| is the number of pairs (x,u l ) 
{Tk}k>o- sequence of integers in [l,oo) 
p® £ (0,1): experimentation probability 
A® £ (0,1): inertia 

8 l £ (0,oo): tolerance level for sub-optimality 
{ a h}n>o'- sequence of step sizes where 

< e [0, 1], En a i = °°- En ( a h) 2 < 00 

(e.g., a l n = l/n r where r £ (1/2,1]) 

Initialize n l 0 £ II 1 (arbitrary), Q l 0 £ Q® (arbitrary) 

Receive Xo 
Iterate k > 0 

(fc—th exploration phase) 

Iterate t = t k ,.. .,t k +i - 1 

,.i = J *k(xt), w.p. 1 -p\ 

* } any u l £ U®, w.p. p®/|U®| 

Receive c l (x t ,u l t ,uf l ) 

Receive xt +i (selected according to P[ • | Xt,u l t ,uf 1 ]) 
n\ = the number of visits to ( Xt,u \) in the fc—th 
exploration phase up to t 

Qi+l( x t,u\) = (1 — a rii)Qi( x ti u t) 

+a hi [ ci ( x t,ui,uf l )+ 0 i min^* Q\{x t+ i,v*)] 
Q l t+1 {x,u l ) = Q\(x,u% for all (x,u®) ^ {x u u\) 

End 

Ul +1 = {^ £^:Ql k+i (x,n i (x)) 

< min„i Ql k+i (x,v l )+6 l , for alia;} 

if <en ® +1 

7r fe+1 = 7r l- 

Else 

i _ ( Tt\, w.p. A® 

n k+i - | any n i e n ® +1 , w.p. (i - A®)/|ir fe+1 | 

End 

Reset Qt k+1 to any Q l £ Q® (e.g., project Ql k+1 onto Q®) 

End 


Algorithm Q] mimics the best reply process with inertia in 
i llTCI arbitrarily closely with arbitrarily high probability under 
certain conditions. The key difference here is that each DM 
using Algorithm [Q approximately learns its optimal Q-factors 
during each exploration phase with limited observations. Ac¬ 
cordingly, each DM updates its (baseline) policy to one of 
its near best replies with inertia based on its learnt Q-factors. 
Hence, Algorithm |T| can be regarded as an approximation to 
the best reply process with inertia in i lll-CI see ED where best 
replies are obtained based on rewards that must be estimated 
using noisy observations. 

Assumption 1: For all (x',x), there exists a finite integer 
H > 0 and joint decisions ilo,.... ujj such that 

P[x H +i = x' I (x 0 , u 0 ,..., u H ) = (x,u 0 ,..., u h )\ > o. 

Assumption Q] ensures that the step sizes satisfy the well- 
known conditions of the stochastic approximation theory f3l| 
during each exploration phase. 

Assumption 2: For all 0 < S’ < 6 and 0 < p 1 < p, where 
5 and p (which depend only on the parameters of the game at 
hand) are defined in Appendix iBl 

Assumption [2] requires that the tolerance levels for sub¬ 
optimality used in the computation of near best replies as well 
as the experimentation probabilities be nonzero but sufficiently 
small. 

Theorem 1: Consider a discounted stochastic game that 
is weakly acyclic under strict best replies. Suppose that 
each DM® updates its policies by Algorithm Q] Let Assump¬ 
tion Q] and |D hold. 

(i) For any e > 0, there exist T < oo, k < oo such that if 
mhif T( > T, then 

P [ 7 Tfe £ n eq ] > 1 — e, for all fc > fc. 

(ii) If Tfc —> 00 , then 

P [ 7 T k £ n eq ] —> 1 . 

(iii) There exists finite integers {T k }k> 0 such that if T k > 
T k , for all fc, then 

P[nk 7r*, for some 7r* £ n eq ] = 1. 

Proof: See Appendix [B] ■ 

Let us discuss the main idea behind this result. Since all 
DMs use constant policies throughout any particular explo¬ 
ration phase, each DM indeed faces a stationary Markov 
decision problem in each exploration phase. Therefore, if 
the length of each exploration phase is long enough and 
the experimentation probabilities p 1 ,..., p N are small enough 
(but non-zero), each DM® can learn its corresponding optimal 
Q-factors in each exploration phase with arbitrary accuracy 
with arbitrarily high probability. This allows each DM® to 
accurately compute its near best replies to the other DMs’ 
policies n7 l at the end of the fc—th exploration phase. Intu¬ 
itively, allowing each DM® to update its policy tt/ to its near 
best replies (to 7rjT®) at the end of the fc—th exploration phase 
with some inertia A® £ (0,1) results in a policy adjustment 
process that approximates the best reply process with inertia 
in StO 
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Remark 1: One may also wish to find explicit lower- 
bounds on Tfc to achieve almost sure convergence based on 
the convergence rates of the standard Q-learning with a single 
DM; we refer the reader to l42l for bounds on the convergence 
rates for standard Q-learning. 

IV. Generalizations 

A. Learning in Weakly Acyclic Games under Strict Better 
Replies 

We present another Q-learning algorithm with provable 
convergence to equilibrium in discounted stochastic games 
that are weakly acyclic under strict better replies. For this, 
we first introduce the notion of weak acyclicity under strict 
better replies. Given any n = (7r*,7r - *) £ A, let denote 
DM”s set of (deterministic) better replies with respect to 7 r, 
i.e., 

TJr := {fr* e IT : J x (tt\ ir~ l ) < J x (tt\tt~' 1 ), for all x}. 

Any better reply tt* £ Yi. of DM* is called a strict better reply 
(with respect to 7r) if 

J*(fr*, n~ l ) < J* (V, 7T _l ), for some x. 

Definition 4: We call a (possibly finite) sequence of deter¬ 
ministic joint policies 7To, 7Ti,... a strict belter reply path if, 
for each k, irk and irk+i differ in exactly one DM position, 
say DM*, and n k+1 is a strict better reply with respect to ir k . 

Definition 5: A discounted stochastic game is called weakly 
acyclic under strict better replies if there is a strict better reply 
path starting from each deterministic joint policy and ending 
at a deterministic equilibrium policy. 

Since every strict best reply path is also a strict better reply 
path, the set of games weakly acyclic under better replies 
contain (in fact, strictly) the set of games weakly acyclic under 
best replies. 

It is straightforward to introduce a policy adjustment process 
analogous to the one in ' 11 1 -Cl where, at each step, each DM* 
switches to one of its strict better replies with some inertia; 
see Sections 6.4-6.5 in (30). Such a process would clearly con¬ 
verge to an equilibrium in games that are weakly acyclic under 
strict better replies. We next introduce a learning algorithm 
which allows each DM to learn the Q-factors corresponding 
to two policies, a baseline policy and a randomly selected 
experimental policy, during each exploration phase. If the 
learnt Q-factors indicate that the experimental policy is better 
than the baseline policy within a certain tolerance level, then 
the baseline policy is updated to the experimental policy with 
some inertia at the end of each exploration phase. This learning 
algorithm enables DMs to adjust their policies with much less 
information (as in d 111 - Ab . and follow (approximately) along 
the strict better reply paths that the adjustment process follows. 

Algorithm 2 (for DM 1 ): 

Set parameters as in Algorithm [j] 

Initialize 7 Tq, 7Tq £ IP (arbitrary except 7 r q ^ 7 Tq), Qq,Qq £ Q* 
(arbitrary) 

Receive xo 
Iterate k > 0 


(k —th exploration phase) 

Iterate t = t kl ... ,t k +i - 1 

v i = J K(x t ), w.p. 1 - p\ 

4 | any u l £ U*, w.p. p 1 / |U*| 

Receive c l (x t ,u l t ,uf l ) 

Receive Xt+i (selected according to P[ ■ | x t , u\, wj - *]) 
n\ = the number of visits to ( Xt,u \) in the k —th 
exploration phase up to t 

Qt +1 (x t ,u\) = { 1 - )Q\(x t , u\) 

+ a ni [C*0 + FQl{x t+1 ,TTl{x t+1 ))\ 

Qi+i(x t ,ul) = (1 - oA n f)Q\{x t ,u\) 

+oA [ c * (x t , u \, uf l ) + /3*Q*(x t+ 1 , 7T* (xt +1 ))] 
Q l t+1 (x,u l ) = Ql(x,u l ), for all (x, u l ) {x t ,u\) 
Ql + i(x,u l ) = Ql(x, u l ), for all (x,u l ) ± {x t ,uf) 

End 

If (Qi k+1 ( x ^k( x )) < Qi k+1 { x >ni(x)) + S\ for all x) 

and 

(Qi k+1 { x ^i( x )) < Qi k+1 ( x ^i( x ))- 6 ^ for some x ) 

tt|, w.p. A* 

fei, w.p. 1 - A* 

Else 

7T k +1 = n k 

End 

TT l k+ 1 = any policy tt 1 £ IT\{7r^ +1 } with equal 
probability 

Reset Q\ k+i , Q\ k+i to any Q i 1 Q i £ Q* 

End 

Since any policy except the baseline policy can be chosen 
as an experimental policy (with equal probability), each DM 
can switch to any of its strict better replies with positive 
probability. In contrast, each DM using Algorithm Q] can only 
switch to one of its strict best replies. As a result, each DM 
using Algorithm [2] can escape a strict best reply cycle by 
switching to a strict better reply (if one exists); whereas, any 
DM using Algorithm Q] cannot. This flexibility comes at the 
cost of running two Q-learning recursions, one for the baseline 
policy and the other for the experimental policy, instead of one. 
However, this flexibility also leads to convergent behavior in 
a strictly larger set of games. We cite (43l as a reference to 
an earlier use of the idea of comparing two strategies and 
selecting one according to the Boltzman distribution. 

The counterpart of Theorem Q] can be obtained for Algo¬ 
rithm [2] in games that are weakly acyclic under strict better 
replies. 

Assumption 3: For all 7 , 0 < S' < 6 and 0 < p 1 < p, where 
5 and p (which depend only on the parameters of the game at 
hand) are defined in Appendix O 

Theorem 2: Consider a discounted stochastic game that 
is weakly acyclic under strict better replies. Suppose that 
each DM* updates its policies by Algorithm [2] Let Assump¬ 
tion |T] and |3] hold. 

(i) For any e > 0, there exist T < 00 , k < 00 such that if 
min^ Te > T, then 

P [7r fc 6 n eq ] > 1 - e, k > k. 

(ii) If Tfc —> 00 , then 

P [ 7 r k £ n eq ] —> 1 . 








(iii) There exists finite integers {Tk}k> o such that if Tk > 
Tfc, for all k, then 

P\jtk —>■ tt*, for some n* £ II eq ] = 1. 

Proof: See Appendix O ■ 

B. Learning in Weakly Acyclic Games under multi-DM Strict 
Best or Better Replies 

The notion of weak acyclicity can be generalized by allow¬ 
ing multiple DMs to simultaneously update their policies in a 
strict best or better reply path. 

Definition 6: We call a (possibly finite) sequence of de¬ 
terministic joint policies a multi-DM strict best 

(better) reply path if, for each k, irk and ttk+i differ for at 
least one DM and, for each deviating DM 1 , 7 r l k+1 is a strict 
best (better) reply with respect to 717 . 

Definition 7: A discounted stochastic game is called weakly 
acyclic under multi-DM strict best (better) replies if there 
is a multi-DM strict best (better) reply path starting from 
each deterministic joint policy and ending at a deterministic 
equilibrium policy. 

This generalization leads to a strictly larger set of games that 
are weakly acyclic. To see this, consider a single-stage game 
characterized by the cost matrices in Figure [4] where DM 1 
chooses a row, DM 2 chooses a column, and DM 3 chooses a 
matrix, simultaneously. Assume a > 0. There is no strict best 



1 

2 

3 

1 

—a, 0, 0 

0, a, 0 

0, —a, —a 

2 

a, 0, 0 

—a, —a, 0 

a, 0,0 

3 

0, —a, —a 

0, a, 0 

—a, 0, —a 


1 

12 3 


0, —a, —a 

0,0,0 

0,0,0 

a, 0, 0 

—a, 0, —a 

CL, CL, CL 

—a, —a, 0 

0,0,0 

0,0,0 


2 


Fig. 4 . Cost matrices of a single-stage game with three DMs. 

(or better) reply path to an equilibrium from the joint decisions 
(1,1,1), (1, 3,1), (3, 3,1), (3,1,1), (1,1, 2), (3,1, 2), if only 
a single DM can update its decision at a time. Therefore, this 
game is not weakly acyclic under strict best (or better) replies 
in the sense of Definition [3] (or Definition [3}. However, if 
multiple DMs are allowed to switch to their strict best (or 
better) replies simultaneously, then it becomes possible to 
reach the equilibrium (2,3,2) from any joint decision. For 
example, if DM 2 and DM 3 switch to their strict best (or 
better) replies simultaneously from the joint decision ( 1 , 1 , 1 ), 
then the resulting joint decision would be (1, 3, 2). This would 
subsequently lead to the equilibrium (2,3, 2) if DM 1 switches 
to its strict best (or better) reply from (1, 3, 2). 

All learning algorithms introduced in the paper allow multi¬ 
ple DMs to simultaneously update their policies with positive 
probability. In view of this, it is straightforward to see that 
our main convergence results Theorem [T] (Theorem [3 hold 
in games that are weakly acyclic under multi-DM strict best 
(better) replies. 


V. A Simulation Study: Prisoner’s Dilemma with a 

State 

We consider a discounted stochastic game with two DMs 
where X = U 1 = U 2 = {1,2}. Each DM”s utility (to be 
maximized) at each time t > 0 depends only on the joint 
decisions (u},u 2 ) of both DMs as 


DM - *: 

1 2 


C 

a 

b 

0 


Fig. 5 . DM l ’s single-stage utility. 

We assume b > c > 0 > a. The state evolves as 

p\x t+ i = 1 1 K 1 ,^) = (1,1)] = 1- 7 
P[x t+ i = 2 | (uj,u?) ± (1,1)] =1-7 

where 7 £ (0,1) and P[x 0 = 1] = 1/2. 

The single-stage game corresponds to the well-known pris¬ 
oner’s dilemma where the i—th prisoner (DM*) cooperates 
(defects) at time t by choosing u\ = 1 (u\ = 2). The single- 
stage game has a unique equilibrium (v},u 2 ) = ( 2 , 2 ), i.e., 
both DMs defect, leading to rewards (0,0). The dilemma 
is that each DM can do strictly better by cooperating, i.e., 
(u 1 ,^ 2 ) = ( 1 , 1 ) (not an equilibrium). 

In the multi-stage game, the state 27 indicates, w.p. 1 — 7, 
whether or not both DMs cooperated in the previous stage. It 
turns out that cooperation can be obtained as an equilibrium of 
the multi-stage game if the DMs are patient, i.e., the discount 
factors are sufficiently high, and the error probability 7 is 
sufficiently small . Note that each DM* has four different 
policies of the form 7r* : X —y U*. For large enough 
/3 1 , /3 2 , and small enough 7 , the multi-stage game has two 
(Markov perfect) equilibria. In one equilibrium, called the 
cooperation equilibrium, each DM cooperates if x = 1 and 
defects otherwise. In the other equilibrium, called the defection 
equilibrium, both DMs always defect. Furthermore, from any 
joint policy in n 1 x n 2 , there is a strict best reply path to 
one of these two equilibria, which implies that the multi-stage 
game is weakly acyclic under strict best replies. 

We set b = 2, c = 1, a = — 1, 7 = 0.3. We simulate 
Algorithm [3 with the following parameter values: p' = 0.1, 
A* = 0.5, (5* = 0, a\ = 1 /k 0 ' 51 , for all i,k. We keep the 
lengths of the exploration phases constants, i.e, Tk = T, for 
all k. We consider different values for T since the lengths 
of the exploration phases appear to be most critical for the 
behavior of the learning process. For each value of T, we run 
Algorithm!]] and the best reply process with inertia (in jll-Cb in 
parallel, with 1000 policy updates starting from each of the 16 
initial joint policies in n. We initialize all the learnt Q-factors 
at 0 for each simulation run; however, we do not reset the 
learnt Q-factors at the end of any exploration phase during any 
simulation run. We let 7 and 717 denote the policies generated 
by AlgorithmQ] and the best reply process with inertia in d 11 -Cl 
respectively. For each value of T, Table m shows the fraction 
of times at which 7 Tk visits an equilibrium and the fraction 
of times at which 7 Tk agrees with n during the 1000 policy 
updates (averaged uniformly over all 16 initial policies in n). 
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The results in Table [V] reveals that, as T increases, 
visits an equilibrium and agrees with 7 ffc more often. This is 
consistent with Theorem Q] since DMs are expected to learn 
their Q-factors more accurately with higher probability for 
larger values of T. When T is sufficiently large, the polices 
7 Tk are at equilibrium and agrees with 7 r* nearly all of the 
time regardless of the initial policy. In a typical simulation run 
(with a large enough T), the polices nt and 7 f* transition to 
an equilibrium in few steps and stay at equilibrium thereafter. 


T 

IV^IUOU r 

1001 ^k=0 1 {'*k€ n e} 

(averaged over 7ro G 11) 

Iv^lUUU t 

1001 ^k=0 J {7Tfc=^fc} 

(averaged over no G 11) 

10 

0.2581 

0.1254 

25 

0.5274 

0.3410 

50 

0.7835 

0.6170 

100 

0.9282 

0.6301 

1000 

0.9935 

0.6879 

10000 

0.9978 

0.7733 

50000 

0.9976 

0.9705 


TABLE I 

The fraction of times at which visits an equilibrium and 

THE FRACTION OF TIMES AT WHICH 1T k AGREES WITH 7 T k . 


VI. Concluding Remarks 

In this paper, we develop decentralized Q-learning algo¬ 
rithms and present their convergence properties for stochastic 
games under weak acyclicity. This is the first paper, to our 
knowledge, that presents learning algorithms with convergence 
to equilibria in large classes of stochastic games. The decision 
makers observe only their own decisions and cost realizations, 
and the state transitions; they need not even know the presence 
of the other decision makers. 

Our approach has a two-time scale flavor; however, unlike 
the existing work on multi-time-scale learning, it does not 
depend on the stochastic approximation theory. Note that the 
existing work on multi-time-scale learning, e.g., dm, m, 
E2, ed, require the stability analysis of some ordinary 
differential equations (ODE) describing the mean behavior of 
the learning algorithms. Aside from the difficulty of choosing 
the step sizes running at multiple time scales, the existing work 
involves nonlinear ODEs whose analysis does not seem to be 
within reach even for dynamic team problems. In contrast, 
our approach leads to a considerably simpler analysis for all 
weakly acyclic stochastic games. 

Appendix A 

A Uniform Convergence Result for the Standard 
Q-Learning Algorithm with a Single DM 

Convergence of the standard Q-learning algorithm with a 
single DM is well known ED- However, to prove the results 
of this paper, we need the sample paths generated by the 
standard Q-learning algorithm to well behave with respect 
to the initial conditions. Let us now consider a single-DM 
version of the setup introduced in ijll] where the DM index 
i (in the superscript) is dropped (only in Appendix [A} and 
c(x. u) representing the one-stage cost for applying control u 
at x is an exogenous random variable with finite variance. Let 


us assume that a single DM using a stationary random policy 
7 r £ A updates its Q-factors as: for t > 0, 

Qt+i(x,u) = Q t (x,u), for all (x,u) ± {x t ,u t ) ( 6 ) 

Qt+i(x t ,u t ) = Q t (x t ,u t ) + a nt (c(x t ,u t ) 

+/3min.Q t (x t +i,v)-Qt(x t ,u t )^ (7) 

where the initial condition Qo is given, u t is chosen according 
to n(xt), the state Xt evolves according to P\ ■ |ic t , ut\ starting 
at xo, nt is the number of visits to ( Xt,ut ) up to time t, and 
{ctn}n>o is a sequence of step sizes satisfying 

a n £ [ 0 , 1 ], ^a n = oo, < oo. 

n n 

Lemma 1: Assume that each (x, u ) is visited infinitely often 
w.p. 1. For any e > 0 and compact Q £ Rl XxU l, there exists 
T ^ < oo such that, for any Qq £ Q, 


P 


sup \Q t - Q < e 


t>T? 


> 1 - e 


where | • |oo denotes the maximum norm and Q is the unique 
fixed point of the mapping F :XxDgXxU defined by 


F(Q)(x , it) = E[c(x, u)] + /3 V"' P[x'\x, u] min Q(x ', v) 




for all x , u. 

Proof: Let {Q' t }t>o and {Q"}t> o be the trajectories for 
the initial conditions Q' 0 and Q q, respectively, corresponding 
to a sample path {(x t , u t , c(x t ,u t ))}t> o- It is easy to see that, 
for all t > 0, 


\Qt+i( x t,u t ) ~ Qt+i(x t ,u t )| 

< (1 - a nt )\Qt{x t ,ut) - Q"(x t ,u t )| + a nt /3\Q' t - Q "|oo 


This implies that M t := supg/ q" 6 q | Q' t — Q" |oo is non¬ 
increasing and therefore convergent. Suppose that Mt —> 
M > 0. There exists some f < oo such that ma x t> t M t < 
M( 1 + l//3)/2. Hence, we have, for all t > t, 

\Qt+i(xt,u t ) - Qt +1 (x t ,u t )| 

< (1 - ai ni )\Q' t (x t ,u t ) - Q"(x t ,u t )| + a nt p M ^ 1 ^ 

This leads to: for all (x,u) and t > i, 


\Q' t+1 {x,u) - Qt +1 (x,u) | 

\l-a s )M 0 


TT'tfct 

— ii s=0 


4- 


i-n 


mt{x,u) 
■s=0 


(1 — ot s ) /3M(1 + l//3)/2 


where m t (x,u) := ELo is the number of 

visits to (x,u) in [0, f]. Since each (x,u) is visited infinitely 
often w.p. 1 and EE S a s = oo, we have, for each ( x,u ), 
n^ a ’“ ) (l — a s ) -A- 0 as t —> oo w.p. 1. This implies that 
M < /3M( 1 + l/f})/2 < M w.p. 1, which is a contradiction. 
Therefore, M t —> 0, w.p. 1. 

Theorem 4 in ED shows that, for any initial condition Qo, 
Qt -A Q, w.p. 1 . Hence, for any Q' 0 £ Q, we have \Q' t — Q|oo + 
sup Q » eQ \Q' t -Qt\oo -A 0, w.p. 1. Therefore, sup Q » eQ \Q" - 
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Q|oo —> 0, w.p. 1. This leads to the desired result, i.e., for any 
e > 0 and compact Q £ Ml XxU l, there exists T® < oo such 
that 


P 


sup sup | Q” 
t>T?Q' 0 'eQ 


Q\oo — € 


> 1 - e. 


Remark 2: The Q-factors corresponding to a certain deter¬ 
ministic policy 7r can be learnt by modifying the recursion 
©h© as follows: for t > 0, 

Qt+i{x, it) = Qt(x, u), for all (x, u) ± (x t , u t ) 
Qt+i{x tl ut) = Qt(x t ,u t ) + a nt ( c(x t ,u t ) 

+/3Q t {xt+i,n{x t+ i)) - Qt(xt,u t fj 

where the initial condition Q 0 is given and u t is chosen 
according to 7r(x t ). Hence, the uniform convergence result in 
Lemma 1 also holds for the this recursion. 


Appendix B 
Proof of Theorem|7] 

For any 7r _l £ A _ \ let F* 4 denote the self-mapping of 
XxU' defined by 

F^-i(Q i )(x,u i ) =E n - i{x) [c* {x,u\u~ l ) 

+ ft 1 P [x'\x, u 1 , va.m.Q l {x', i> 1 )] 

*' V i 


for all x,u l . It is well-known that P l is a contraction 
mapping with the Lipschitz constant /3* with respect to the 
maximum norm. Recall from © that each DM’’s optimal Q- 
factors Q\-i is the unique fixed point of We also note 

that, during the k —th exploration phase, each DM* actually 
uses the random policy defined as 

TT J k = (1 — P^)^i + ( 8 ) 


where v 3 is the random policy that assigns the uniform 
distribution on IF to each x. 

Lemma 2: For any e > 0, there exists T e < oo such that, if 
T k > T e , then 


P 


\Ql. — Ql-i I < e, for all i 

l^ r fc + l ^ 7T, l00 — ’ 


> 1 - e. 


Proof: Note that the k —th exploration phase starts with 
XkT, which belongs to the finite state space X, and Q\ £ Q\ 
where Q l is compact, for all i. Note also that, during each 
exploration phase, DMs use stationary random policies of the 
form © and there are finitely many such joint policies. Hence, 
the desired result follows from Lemma Q] in Appendix [~\1 


Lemma 3: For any e > 0, there exists p € > 0 such that, if 
p l < Pf for all i, then 




< e, 

OO 


for all i. k. 


Proof: We have 

Qi-i - Q*i = 

n k oo 

< 


F i - i (Q i -i) - Fi-i(Qi-i) 

vv k ' v k v ’ 

Fi-tiQi-*) - Fi-tiQi-*) 

71 k n k n k 71 k 

+ FI-i (Q i -i) — Fi-i(Qi-i) 

7 T. V ^7T. ' 7T. ' 7T, ' 


< i-na-p^ 

\ i¥* 


-F 


qu-q: 


where £ A - * is some convex combination of the policies 
in A - * of the form where each DIVL, j f i, either uses its 
baseline policy t:J, £ IF or the uniform distributior{] Because 
( 7 ^’, belongs to a finite subset of x A -1 , an upper 
bound F < oo on 

F i . i (Q i . i )-Fi- i (Q i . t ) 

*k *k <t>k 7r fc oo 

exists, which is uniform in (f>f ’). This results in 

F 


Ql-* ~ Ql 


< 1 -na-^) 


J7 4 * 


1 — /3* 


which proves the lemma. ■ 

Let 5 denote the minimum separation between the entries 
of DMs’ optimal Q-factors (with respect to the deterministic 
policies), defined a£] 

5 : = min \Q\-i(x,v l ) - Q\-i{x,v l )\. 

i,x,v z ,v x ,n~ *6ll —x : 

Q l n -i (x,v l )^Q t 7r _ i ( x,v z ) 

We consider 5 to be an upper bound on the tolerance levels 
for sub-optimality, i.e., ft 1 £ (0, 6), for all i. In that case, we 
also introduce an upper bound p > 0 on the experimentation 
rates such that, if p' < p, for all i, then 

Q' _i — Q\-i < - min{<F, 5 — cF}, for all i, k. (9) 
71 k 71 k OO 2 

Such an upper bound p > 0 exists due to Lemma [3] 

Lemma 4: Suppose <5* £ (0,5), p l £ (0 ,p), for all i. For 

any e > 0, there exist T < oo, such that, if Tk > T, then 

P[E k } > 1-e 

where Ek, k > 0, is the random event defined as 


E k ■■= { 


— s oj £ LI : 


1 


<- min{<5*, S — 8 1 }, 


Ql k+1 ~ Q l ^ 

for all i|. 

Proof: The desired result follows from Lemma[2]and I©. 


3 More precisely, (j> k 1 = J 2 jc{l, ..,N}\{i} a jfk?J where aj ■= 

P and ^ k ’ J e A ” 1 is 3 p° lic y such that Kj = 

for j G J and (j) J k j = for j 0 J U {z}. 

4 To avoid trivial cases, we assume Q l _ i (x,v x ) 7^ Q x _i(x, v x ) for some 
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A. Proof of part (i) 

Note that 

w £ E k =>■ n wfc = x • • • x n fc+1 . 

Therefore, we have 

P [7T fc+ l = TT k \E k , 7Tfc £ n eq ] = 1, for all fc. (10) 

Since we have a weakly acyclic game at hand, for each 7r £ II, 
there exists a strict best reply path of minimum length k~ < oo 
starting at 7r and ending at an equilibrium policy. Let L := 
max we np7r- There exists p min £ (0,1) (which depends only 
on A 1 ,..., X N , and L) such that, for all k. 


B. Proof of part (ii) 

For any e > 0, let T < oo, k < oo be as in part (i). Let 
k < oo be such that min, ^ T k > T. It is straightforward to 
see from the proof of part (i) that, for all k > k + k, we have 

P [7r fc £ n oq ] > 1 - e. 

C. Proof of part (iii) 

Pick a sequence {e« }n>o satisfying e n > 0, for all n, and 
^(1 Pulin')— n £n, < OO ( 17 ) 


P[tTk+L £ n eq | Pfc) • ■ * i Pfe+L—1) Ttk ^ H-eq] P Pmin- (11) 
Pick e £ (0, e) satisfying 

(1 — e)_Pmin n ^ 1 

:—77-77-e (1 - e) > 1 - e. 

^ e + (1 e)Pmin ) 

Lemma [4] implies the existence of 7 < oo such that, if 
rnin^ T( > T, then 


P [E k , Pfc + L_i] > 1 - e, for all k. 


( 12 ) 


P(n+1)L ~PnL > [e + (1 — e)p min ] 

This shows that if 


e + (1 - e)pn 


PnL 


PnL < 


(1 - e)Pn 


e + (1 - e)p min 

we have P( n+1 ) L > p n L + Pmi n e- Therefore, whenever p nL 
satisfies it will increase by at least p m \rf until it exceeds 
the right hand side of (TT&b . which will happen in a finite 
number of steps. In fact, p n L would increase as long as 
PnL < ~| 1 ,r e ^ mln ■ On the other hand, if p nL > -fVr e ^ mlp , 
p n L cannot decrease more than e; recall (1151) . Therefore, there 
exists h < oo such that, for all n > n, 

(1 - e)p mirl 


PnL > 


e + (1 — e)p m in 

Finally, due to (IT3T> . we have, for all n > h, i £ {1 ,..., P— 1}, 
(1 - e)p m - 


PnL+t > 


: + (1 - e)p„ 


- e (1 - e) > 1 - e. 


where p m ; n is as in ( flTb . Lemma [4] implies the existence of a 
sequence {P„} n >o of finite integers such that if 


TnL, ■ ■ ■ , T(n+l)L-l > T n 


(18) 


then 


For the rest of this part, we assume miiu 7) > T. From 
O- CGK we obtain 

P [n k+L £ n eq |7r fe £ n eq ] > p min (l - e), for all k 

and 

P [7Tfc+L = ■ • ■ = 7Tfc |7Tfc £ n eq ] > 1 - e, for all k. (13) 

This leads to the recursive inequalities 

P(n+1)L > (1 - e)[Pni + Pmin(l ~ PnL)] (14) 

where p k := P[tt k £ n eq ], for all k. Note that we have, for 
all n, 

P{n+1)L - PnL > ~e. (15) 

We rewrite (TTfl) as 

(1 - e)Pmin 


P [ E n L , . . • , P(n+1)L —l] > 1 ~ (19) 

We assume (fT8l> (therefore ( | 1 9b ) holds for all n. This leads to 
P[77(ra+1)L ^ Heq] 5) (1 Pmin)P ^ Heq] 4“ 

From this, it is straightforward to obtain 

p [^(n+iiL ^ n e ] 

< (1 - Pmin) n ( 1 + ^(1 - Pmin) _S e S ) ■ 


s—0 


(16) 


Due to ( fl9l ). we have, for l £ {0,..., L — 1}, 

P \ttnL+t £ Ileq] 22 (1 ^n)P \^nL £ lleq] ■ 

Therefore, for l £ {0,..., L — 1}, 

P [^(n+ijL+r ^ n e q] 

< (1 ~ Pmin)” ^1 + ^~^(1 ~ Pmin) + ?n+l- 

From this and ( fTTb . we obtain 

p [7T fe i n e ] 


k> 1 


<^E 

n >0 _ 


(1 — Pmin)" ( 1 + ^"^(1 — Pmin) S £s j + £n+l 


s=0 


< OO. 

Borel-Cantelli Lemma implies 

P[ Tt k II eq , for infinitely many k] = 0. 


( 20 ) 


From (fT71) and (1 19b . we obtain X}fc>o P < oo. Borel- 

Cantelli Lemma again implies 

P[Q\E k , for infinitely many k] = 0. (21) 

Finally, (l20b and ( 12 1 b imply the desired result. 
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Appendix C 
Proof of Theorem[2] 

For any it = (7r®,7r -z ) g n* x A~\ let F. denote the 
self-mapping of X x U* defined by 

FUQ i ){x,u i )=E v -i ix) [P(x,u\u~') 

+ /3* ^2 P \ x '\ x , u\u~ l ] Q l (x', 7r*(a:'))] 


for all x,u l . It is well-known that F* is a contraction mapping 
with the Lipschitz constant /3 l with respect to the maximum 
norm. Let us denote the unique fixed point of F* by Q l n . We 
also note that, during the fc—th exploration phase, each DM 1 
actually uses the random policy Tr' k defined as 

*1 = + (22) 

where v 3 is the random policy that assigns the uniform 
distribution on IP to each x. 

Lemma 5: For any e > 0, there exists T e < oo such that, if 
T >T e , then 


P 


Qi + * ~ 

Qi +1 


Q 


( 7 r fe> 7 r fc‘) 


-Q) 




< e and 
< e, for all i 

OO 


> 1 — e, for all k. 


Proof: Note that each exploration phase starts with x k x, 
which belongs to a finite state space, and Q kT , Q\ T g Q\ 
where Q l is compact, for all i. Note also that, during each 
exploration phase, DMs use stationary random policies of 
the form d22t and there are finitely many such joint poli¬ 
cies. Hence, the desired result follows from Lemma Q] in 
Appendix lAl see Remark |2] ■ 

Lemma 6: For any e > 0, there exists p e > 0 such that, if 
p l < Pe, for all i, then 


Ql , i -*, 


< € 


and 


Q%, -is - Q%i — 


< e, 


for all i, k. 


Proof: We have 


Ql i -is-Qli-i, 


= 

F U,r> ( 



Q (K^F)) 

< 

A;,,-) ( 


~ F U.;‘> ( 



f: 


(K^k 1 ) 




< [ l - n(i - (P 


FI i 
K- 71 


(«k,o) 


— F, 






Ql i -ii-Qli-u 
K) 


where g A -1 is some convex combination of the joint 
policies of the form where each DM- 7 , j f i, either uses its 
baseline policy ir J k g IP or the uniform distribution (as in 


Appendix iBli. Because (nj,, ir k *, <f> k ®) belongs to a finite subset 
of n* x x A~\ an upper bound F < oo on 


F l ■ _■ 

) 


( Q W.o) 


— Fl 

(7T 


(«k.-r») 


exists, which is uniform in (ttI, 7r k f k *). This results ir 


Ql i -i,-Ql i — i\ 
K ^k ) O l’ n k ) 


which leads to the first bound. The second bound can be 
obtained similarly. ■ 

Let 6 denote the minimum separation between the entries 
of DMs’ Q-factors (for deterministic policies), defined af] 



S := : 


i,x,n\r g n*,7g nr\ 

Q\ir\K-i)( x ,^( x )) ± Q\i\ 7T-‘) (*))}■ 

We consider <5 to be an upper bound on the tolerance levels 
for sub-optimality, i.e., 5 l g (0, 6), for all i. In that case, we 
also introduce an upper bound p > 0 on the experimentation 
rates such that, if p l < p, for all i, then 


Ql i 

(K^k ) (K^k ) 

Qfi -u-Ql.i-u 

(K’*k ) (K’ n k ) 


} < i min{<T, <5 — <5^} 


(23) 

for all i, k. Such an upper bound p > 0 exists due to Lemma[6] 
Lemma 7: Suppose 0 < 5 l < <5, 0 < p l < p, for all i. For 
any e > 0, there exist T < oo, such that, if T k > T, then 

P [E k ] > 1 - e 

where E k , k > 0, is the random event defined as 


E k := { 


= <wgll: max 


{ 


«L, - <3k.o 
Qlk +1 - 




< — min{5*, $ — <5*}, for all i 

Proof: The desired result follows from Lemma[5]and (1231) . 

■ 

We have 

P [n k+1 = Tr k \E k , t r fc g n eq ] = 1, for all k. (24) 

Since we have a weakly acyclic game at hand, for each n g 
n, there exists a strict better reply path of minimum length 
Lk < oo starting at 7r and ending at an equilibrium policy. Let 
L := max-n-gn L n . There exists p m i n g (0,1) (which depends 
only on A 1 ,..., X N , and L) such that, for all k, 

P L ^ ffeq | F k: . . . , E k - f-Z/ — 1 ? ^ ffeq] Pmin- (25) 

Pick e g (0,e) satisfying 
(1 - e)p m 


: + (1 - e)Pn 


- I (1 - e) > 1 - c. 


5 We assume ^ Q^ i ^_ i ^{x,Tr' l (x)), for some i, 

x, 7r*, 7r 1 G I! 1 , 7T * G II - \ to avoid trivial cases. 
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Lemma [7] implies the existence of T < oo such that, if 
min^ Tz > T, then 

P [E k ,..., E k + L - 1] > 1 - e, for all k. (26) 

For the rest of the proof, we assume minx 7) > T. From ES. 
El. El- we obtain, for all k, 

P £ Ifeql'Trfc ^ f^-eq] 77 Pmin(l r) 

and P [7 T k+L = • • • = 7r fc |7r fe G n eq ] > 1 - e. 

This leads to the recursive inequalities 


P(n+1)L > (f - e)\PnL + Pmin(l - P n L% ( 27 ) 

where p k := P[n k G n eq ]. Note that these inequalities are 
similar to d and by similar reasoning, there exists n < oo 
such that, for all n> h and £ G {1,..., L — 1}, 


PnL+i ^ 


(1 - e)p min 
e + (1 - e)Pmin 



e) > 1 - e. 


This proves part (i). The proofs of part (ii)-(iii) are analogous 
to the proofs of part (ii)-(iii) of Theorem Q] respectively. 
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